This try to be a parcial market study about hospitality sector in Barcelona's city. It's also included the hostelry and recreational bussiness like gymns, in order to support the auxiliar service and utilities industry that supplay to those mentioned main industries.
For instance, catering companies that work serving to hotels or restaurants, cleaning and desinfection companies, laundries, kitchenware suppliers, fresh food, etc.. all the companies that give services or supply goods for hotels, hostals, restaurants, and/or gyms in Barcelona
All the mentioned services's and utilities companies above need to know where fisicaly their possible objetive clients are. Also, they need to create areas dividing the city in order to give the best posible service for their clients. That's why, this studie try to use the known data bases of Barcelona to divide the city in areas and get a possible location for their warehouse or centralitation office for each area.
It is possible to use K-Mean clustering to divide the city in areas or groups because K-Mean try to group the elements with the less posible distance between elements of the group but getting the highest distance between each group. Furthermore, the K-Mean method provide the centroids, it means, a center point of the group where idealy can be localized an operational centralitation like warehouse or office to operate each group. Instead of use an Euclidean space, we can use longitude and latitude as a Cartesian space for K-Means.
https://opendata-ajuntament.barcelona.cat/
In the link above it's posible to find a large source of oficial information from Barcelon's Council. Thanks to this open data availability , it's why I decided to focuse the study in Barcelona's city. The dataset provides the information in wide range of formats. The acaccessibility can be directly through the links in the web or through their different API's. A better explanation of the API management can be found on the link below :
https://opendata-ajuntament.barcelona.cat/en/desenvolupadors
Because in the council dataset, the information about potential recreational business like gyms are mixed with other sectors and the information is not easy to handle, in the study will be used Foursquare dataset to get the closed gyms to the centroids based on their distance. To get it, the Python Foursquare pakage will be used to get an easy access.
python version : 3.8.6 pandas version : 1.1.4 numpy version : 1.19.3 folium version : 0.11.0 matplotlib version : 3.3.2 foursquare version : 1!2020.1.30 sklearn version : 0.23.2 geopy version : 2.0.0
Please, for further information, see the link below to get an explanation of how to get the information from Barcelona's open data repository. For this study it's intreating the information of gesthouses, hostels, hotels and restaurants. That information is available in different formats, in this case, the CVS format will be selected. The code will generate a pandas dataframe for each table and print the name of the dataframe and its shape to get an idea about how long the table it's.
https://opendata-ajuntament.barcelona.cat/en/desenvolupadors
DataFrame guesthouse shape : (580, 33) DataFrame hostel shape : (564, 33) DataFrame hotel shape : (884, 33) DataFrame restaurants shape : (16936, 33) DataFrame Neighborhood_BCN shape : (73, 4)
"guesthouse", "hostel" and "hotel" tables contain the same structure so all of them can be transformed and join in a single dataframe
A "Category" column is created to define the diferent type of business to keep that information after being joined
In "...NIVELL" columns it's posible to find which category of accommodation it is. For instance, "3ER_NIVELL" column contains the hotel category stars, "2N_NIVELL" column contains if the guesthouse is an apartment , students residency, young hostel or other type. For hostels there isn't any farther information. What it's propoused is to map that information with numbers from 10 to 0. On one hand that information can be categorized, making luxury hotels more interesting than young hostels.On the other hand, it's better for an easier access in further analysis.
Definitions :
- For hotels, stars +10 --> from 0 starts to 5 starts = 5 to 10.
- hostels =4
- guesthouse;
- Apartments =3
- student housing and University residences =2
- young hostel = 1
- not defined categories = 0
Now the tables are joined in one dataframe called 'accommodation'
accommodations dataframe
| CODI_EQUIPAMENT | EQUIPAMENT | SECCIO | TIPUS_VIA | NOM_CARRER | NUM_CARRER_1 | NUM_CARRER_2 | CODI_BARRI | NUM_BARRI | CODI_DISTRICTE | ... | HORARI_DIES | HORARI_HORES_INICI | HORARI_HORES_FI | HORARI_OBSERVACIONS | 3ER_NIVELL | 2N_NIVELL | 1ER_NIVELL | Unnamed: 32 | CATEGORY | LEVEL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 146175914 | Residència Università ria Lesseps | # | Pl | Lesseps | 12 | 12.0 | 31 | la Vila de Grà cia | 6 | ... | NaN | NaN | NaN | NaN | NaN | Residències d'estudiants | Allotjament | NaN | guesthouse | 2 |
| 1 | 146175914 | Residència Università ria Lesseps | # | Pl | Lesseps | 12 | 12.0 | 31 | la Vila de Grà cia | 6 | ... | NaN | NaN | NaN | NaN | NaN | Residències d'estudiants | Allotjament | NaN | guesthouse | 2 |
| 2 | 146175914 | Residència Università ria Lesseps | # | Pl | Lesseps | 12 | 12.0 | 31 | la Vila de Grà cia | 6 | ... | NaN | NaN | NaN | NaN | NaN | Residències d'estudiants | Allotjament | NaN | guesthouse | 2 |
3 rows × 35 columns
Unnecessary columns and possible dupicated lines are droped
Columns before cleaning : Index(['CODI_EQUIPAMENT', 'EQUIPAMENT', 'SECCIO', 'TIPUS_VIA', 'NOM_CARRER',
'NUM_CARRER_1', 'NUM_CARRER_2', 'CODI_BARRI', 'NUM_BARRI',
'CODI_DISTRICTE', 'NOM_DISTRICTE', 'CODI_POSTAL', 'CODI_POBLACIO',
'POBLACIO', 'LATITUD', 'LONGITUD', 'X_ETRS89', 'Y_ETRS89', 'X_ED50',
'Y_ED50', 'TELEFON_NUM', 'TELEFON_TIPUS', 'TELEFON_INFO_COM',
'HORARI_PERIODE_INICI', 'HORARI_PERIODE_FI', 'HORARI_DIES',
'HORARI_HORES_INICI', 'HORARI_HORES_FI', 'HORARI_OBSERVACIONS',
'3ER_NIVELL', '2N_NIVELL', '1ER_NIVELL', 'Unnamed: 32', 'CATEGORY',
'LEVEL'],
dtype='object')
Shape : (2028, 35)
Columns after cleaning : Index(['CODI_EQUIPAMENT', 'EQUIPAMENT', 'CODI_BARRI', 'NUM_BARRI',
'CODI_DISTRICTE', 'NOM_DISTRICTE', 'CODI_POSTAL', 'LATITUD', 'LONGITUD',
'CATEGORY', 'LEVEL'],
dtype='object')
Shape : (1031, 11)
And print the dataframe
accommodations dataframe (1031, 11)
| CODI_EQUIPAMENT | EQUIPAMENT | CODI_BARRI | NUM_BARRI | CODI_DISTRICTE | NOM_DISTRICTE | CODI_POSTAL | LATITUD | LONGITUD | CATEGORY | LEVEL | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 146175914 | Residència Università ria Lesseps | 31 | la Vila de Grà cia | 6 | Grà cia | 8023 | 41.407168 | 2.151154 | guesthouse | 2 |
| 3 | 146180329 | Residència Università ria La Ciutadella | 66 | el Parc i la Llacuna del Poblenou | 10 | Sant Martà | 8018 | 41.391563 | 2.186115 | guesthouse | 2 |
| 6 | 75990016464 | Col.legi Major Universitari Lestonnac | 7 | la Dreta de Eixample | 2 | Eixample | 8007 | 41.392634 | 2.165757 | guesthouse | 2 |
Drop duplicated lines unnecessary columns and print
Restaurants columns : Index(['CODI_EQUIPAMENT', 'EQUIPAMENT', 'SECCIO', 'TIPUS_VIA', 'NOM_CARRER',
'NUM_CARRER_1', 'NUM_CARRER_2', 'CODI_BARRI', 'NUM_BARRI',
'CODI_DISTRICTE', 'NOM_DISTRICTE', 'CODI_POSTAL', 'CODI_POBLACIO',
'POBLACIO', 'LATITUD', 'LONGITUD', 'X_ETRS89', 'Y_ETRS89', 'X_ED50',
'Y_ED50', 'TELEFON_NUM', 'TELEFON_TIPUS', 'TELEFON_INFO_COM',
'HORARI_PERIODE_INICI', 'HORARI_PERIODE_FI', 'HORARI_DIES',
'HORARI_HORES_INICI', 'HORARI_HORES_FI', 'HORARI_OBSERVACIONS',
'3ER_NIVELL', '2N_NIVELL', '1ER_NIVELL', 'Unnamed: 32'],
dtype='object')
Shape : (16909, 33)
Restaurants columns after cleaning : Index(['CODI_EQUIPAMENT', 'EQUIPAMENT', 'CODI_BARRI', 'NUM_BARRI',
'CODI_DISTRICTE', 'NOM_DISTRICTE', 'CODI_POSTAL', 'LATITUD',
'LONGITUD'],
dtype='object')
Shape : (2469, 9)
And print the first lines of the table
restaurants dataframe
| CODI_EQUIPAMENT | EQUIPAMENT | CODI_BARRI | NUM_BARRI | CODI_DISTRICTE | NOM_DISTRICTE | CODI_POSTAL | LATITUD | LONGITUD | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 53170444 | Restaurant Tapes Tapas *Poble Espanyol | 11 | el Poble-sec | 3 | Sants-Montjuïc | 8038 | 41.369148 | 2.146751 |
| 9 | 94135210 | Restaurant Japonès Ginza | 8 | l'Antiga Esquerra de l'Eixample | 2 | Eixample | 8008 | 41.392182 | 2.157920 |
| 15 | 140145849 | Hotel RÃvoli Ramblas - HB-003878 | 2 | el Barri Gòtic | 1 | Ciutat Vella | 8002 | 41.384315 | 2.171183 |
For display the information on the map, it's used folium technology
In this map we see all the restaurants and accomodations in the city. In the control layer can be selected the differnt layers and the marks has been clustered for a clear visualization.
Let's see the information in a choropleth map. In one map are included the accomodations and it's density by neighborhoods. The next one contain the restaurants and density by neighborhoods, also. For that, it's prepared a table grouping the elements by neighborhood.
As we can see, most of the restaurans and accomodations are concentred in two or three neighborhood arround the city center:
| Neighb_code | Accom_quantity | District_code | District_name | Neighb_Name | |
|---|---|---|---|---|---|
| 6 | 7 | 253 | 2 | Eixample | la Dreta de l'Eixample |
| 1 | 2 | 112 | 1 | Ciutat Vella | el Barri Gòtic |
| 7 | 8 | 91 | 2 | Eixample | l'Antiga Esquerra de l'Eixample |
| 0 | 1 | 86 | 1 | Ciutat Vella | el Raval |
| 30 | 31 | 40 | 6 | Grà cia | la Vila de Grà cia |
| Neighb_code | Restaur_quantity | District_code | District_name | Neighb_Name | |
|---|---|---|---|---|---|
| 6 | 7 | 291 | 2 | Eixample | la Dreta de l'Eixample |
| 7 | 8 | 265 | 2 | Eixample | l'Antiga Esquerra de l'Eixample |
| 1 | 2 | 159 | 1 | Ciutat Vella | el Barri Gòtic |
| 30 | 31 | 159 | 6 | Grà cia | la Vila de Grà cia |
| 25 | 26 | 151 | 5 | Sarrià -Sant Gervasi | Sant Gervasi - Galvany |
Those companies which doesn't want to provide services to all of the possible clients (restaurants and accomodations) in all over the city, better will be concrentrated on 'Eixample' and 'Ciutat Vella' districts.
For this task is possible to use KMeans. It will divide the clients in the number of groups that we want and provide the centroids where to locate the workin-centers. Each group will optimize the resources
First of all it's necesary to determine the number of areas. So we can use the Inertia to evaluate wich number of groups will be better. let's see what is the inertia dividing the city between 2 to 8 groups for accomodations, and the same for restaurants:
GROUPS OF ACCOMMODATIONS inertia for 2 groups : 0.251 inertia for 3 groups : 0.185 inertia for 4 groups : 0.142 inertia for 5 groups : 0.117 inertia for 6 groups : 0.1 inertia for 7 groups : 0.087 inertia for 8 groups : 0.074 variance between 2 and 3 groups : 0.066 variance between 3 and 4 groups : 0.043 variance between 4 and 5 groups : 0.025 variance between 5 and 6 groups : 0.016 variance between 6 and 7 groups : 0.014 variance between 7 and 8 groups : 0.012 ---------------------------------------------- GROUPS OF RESTAURANTS inertia for 2 groups : 0.829 inertia for 3 groups : 0.577 inertia for 4 groups : 0.447 inertia for 5 groups : 0.364 inertia for 6 groups : 0.298 inertia for 7 groups : 0.255 inertia for 8 groups : 0.22 variance between 2 and 3 groups : 0.253 variance between 3 and 4 groups : 0.13 variance between 4 and 5 groups : 0.083 variance between 5 and 6 groups : 0.066 variance between 6 and 7 groups : 0.043 variance between 7 and 8 groups : 0.035 ---------------------------------------------- GROUPS OF ACCOMODATIONS AND RESTAURANTS TOGETHER inertia for 2 groups : 1.107 inertia for 3 groups : 0.767 inertia for 4 groups : 0.597 inertia for 5 groups : 0.493 inertia for 6 groups : 0.413 inertia for 7 groups : 0.354 inertia for 8 groups : 0.304 variance between 2 and 3 groups : 0.253 variance between 3 and 4 groups : 0.13 variance between 4 and 5 groups : 0.083 variance between 5 and 6 groups : 0.066 variance between 6 and 7 groups : 0.043 variance between 7 and 8 groups : 0.035
We can observe that applying KMean on lower quantity of elements, generate lower inertia. It means, considering only the accomodations will generate less inertia than the restaurants or all together because there are less elements. However, the sum of the individual inertias is higher than the inertia of all together .
So we will consider all the elements together and not separete restaurants and hotelsSecondly, we observe that as higher is the number of the groups, lower is the inertia (something logical). But the difference betwbetween 7 to 8 groups it's very low and considering that as more groups we do, less resources we have because the resorcess should be distributed by groups. It's necesary to find the less groups with a better performance.
For that reason, we will considere a division in 6 groups
Let's apply the kMeans analysis for all elements in 6 groups and generatea a table including all the elements and a new column with the number of the group to which they belong to.
all elements with assigned area (3500, 12)
| CODI_EQUIPAMENT | EQUIPAMENT | CODI_BARRI | NUM_BARRI | CODI_DISTRICTE | NOM_DISTRICTE | CODI_POSTAL | LATITUD | LONGITUD | CATEGORY | LEVEL | GROUP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 146175914 | Residència Università ria Lesseps | 31 | la Vila de Grà cia | 6 | Grà cia | 8023 | 41.407168 | 2.151154 | guesthouse | 2.0 | 4 |
| 1 | 146180329 | Residència Università ria La Ciutadella | 66 | el Parc i la Llacuna del Poblenou | 10 | Sant Martà | 8018 | 41.391563 | 2.186115 | guesthouse | 2.0 | 0 |
| 2 | 75990016464 | Col.legi Major Universitari Lestonnac | 7 | la Dreta de Eixample | 2 | Eixample | 8007 | 41.392634 | 2.165757 | guesthouse | 2.0 | 4 |
| 3 | 75990020227 | Residencia Madre Isabel Larrañaga | 24 | les Tres Torres | 5 | Sarrià -Sant Gervasi | 8017 | 41.397669 | 2.128852 | guesthouse | 2.0 | 2 |
| 4 | 75990020841 | Residencia e Institución Javeriana | 4 | Sant Pere, Santa Caterina i la Ribera | 1 | Ciutat Vella | 8003 | 41.385713 | 2.177711 | guesthouse | 2.0 | 0 |
Now we can divide that long dataframe in 6 dataframes one for each group:
shape of dataframe "group_0" : (1117, 12) shape of dataframe "group_1" : (452, 12) shape of dataframe "group_2" : (517, 12) shape of dataframe "group_3" : (184, 12) shape of dataframe "group_4" : (974, 12) shape of dataframe "group_5" : (256, 12)
à Voilà !! Here we can see all the restaurants and accomodations of Barcelona clustered in 6 groups by K-Mean.
And the centroids, with their locations
centroid for group 0 : 41.38549036021505, 2.1756782096774194 location: 4, Carrer del Doctor Joaquim Pou, el Gòtic, Ciutat Vella, Barcelona, Barcelonès, Catalunya, 08002, España centroid for group 1 : 41.37808499559471, 2.156656528634361 location: 65, Carrer de Calà bria, Project Area Sant Antoni, Font Trobada, Barcelona, Barcelonès, Catalunya, 08015, España centroid for group 2 : 41.39092840038685, 2.1329128065764023 location: 10, Carrer de la Caravel·la La Niña, les Corts, Barcelona, Barcelonès, Catalunya, 08017, España centroid for group 3 : 41.42713067391304, 2.1785231576086956 location: 4, Carrer de Sant Pasqual Bailón, el Congrés, el Congrés i els Indians, Sant Andreu, Barcelona, Barcelonès, Catalunya, 08027, España centroid for group 4 : 41.3966167995889, 2.158021282631038 location: Església i convent de Pompeia, Carrer de la Riera de Sant Miquel, la Vila de Grà cia, Grà cia, Barcelona, Barcelonès, Catalunya, 08001, España centroid for group 5 : 41.4009648828125, 2.19924386328125 location: Schindler. Liberty Seguros. Regal, Carrer de Pujades, el Poblenou, Sant MartÃ, Barcelona, Barcelonès, Catalunya, 08001, España
As commented in the introduction of the article , we can include the gyms, too. Because they have some similar requirements of providers and services as accomodations or restaurants. Information about gyms is possible to be found in Barcelona's open data repository. But it's mixed with other business, so easily it can be provided by Foursquare. For this process it's used the Foursquare's API which can found in pip repository.
<foursquare.Foursquare at 0x1b27940cac0>
Now we can get the closest gyms to the centroids as possible clients to be study. Due to the limitations of Foursquare , a limit of 50 items is settled for each group and a radious of 1.5km arroun the center point. The result are achieved in separated dataframes for each group
DataFrame Gyms_group_0 shape : (49, 8) DataFrame Gyms_group_1 shape : (50, 8) DataFrame Gyms_group_2 shape : (48, 8) DataFrame Gyms_group_3 shape : (44, 8) DataFrame Gyms_group_4 shape : (49, 8) DataFrame Gyms_group_5 shape : (49, 8)
Below we can see two examples of the dataframes:
DataFrame Gyms_group_5
| NAME | LOCATION | LATITUD | LONGITUD | DISTANCE_CENTOID | ZIP | NEIGHBORHOOD | GROUP | |
|---|---|---|---|---|---|---|---|---|
| 0 | Àgora (Psicoterapia y Desenvolvimiento Humano) | Calle Pujades | 41.399024 | 2.197441 | 263 | 08005 | NaN | 5 |
| 1 | Barcelona Martial Arts Academy (BCNMA) | Roc Boronat, 66 | 41.400230 | 2.198661 | 95 | NaN | NaN | 5 |
| 2 | Corporació Fisiogestion | Carrer de la Ciutat de Granada, 115 | 41.401710 | 2.193483 | 488 | 08018 | NaN | 5 |
| 3 | Feeling Woman | Pere IV 29-35, 3º 5ª | 41.394764 | 2.190814 | 985 | 08018 | NaN | 5 |
| 4 | Duet Fit | 208 Avinguda Diagonal | 41.404581 | 2.191114 | 789 | 08018 | NaN | 5 |
DataFrame Gyms_group_3
| NAME | LOCATION | LATITUD | LONGITUD | DISTANCE_CENTOID | ZIP | NEIGHBORHOOD | GROUP | |
|---|---|---|---|---|---|---|---|---|
| 0 | Duet Sports Coxteres Borbó (Oficial). | C/ Ramón Albó, 41-57 | 41.426602 | 2.178902 | 66 | 08027 | NaN | 3 |
| 1 | Gym Guinardó | Carrer del Telègraf, 33-65 | 41.416792 | 2.173619 | 1221 | 08041 | NaN | 3 |
| 2 | Holmes Place | Avinguda de Rio de Janeiro, 42 | 41.435380 | 2.180947 | 940 | 08016 | NaN | 3 |
| 3 | Crossfit Sant Andreu | 71 Calle del Segre | 41.431867 | 2.194215 | 1411 | 08030 | NaN | 3 |
| 4 | Club Natació Sant Andreu | Rambla Fabra i Puig 45-47 | 41.429418 | 2.186491 | 712 | 08030 | NaN | 3 |
Let's see the information on a map:
As we can see, K-Mean can be used also to distribute a big group of phisical points in the space in small groups which get the less distance between the elements into each group. This can be useful for courier companies, distribution, logistics, etc..